Neotoma
|
This workshop is designed to use paleoecological data and associated internet resources to provide participants with the following:
Guidance on best practices in the archiving and analysis of paleovertebrate data
Training in the use of the Neotoma Paleoecology Database (www.neotomadb.org) to archive, access, and analyze paleoecological data.
Neotoma is a multiproxy paleo-database that stores a range of paleoecological & paleoenvironmental data, including vertebrate faunal data. One of the strengths of Neotoma is the ability to compare faunal data with other proxy data such as fossil pollen, diatoms, ostracodes, insects, charcoal, and geochemical data. In addition, the database is structured to relate absolute dates to taxon occurrences and to allow the creation and storage of age models built on absolute dates from stratigraphic sections. Neotoma is a public-access, community-supported database that is emerging as the standard repository for Pliocene and Quaternary paleoecological data.
More teaching materials can be found in Neotoma’s educational resources.
QUESTION 1: What is the latitude and longitude of New Paris #4? What is the Site ID?
QUESTION 2: How many total sites are found in that region?
QUESTION 3: In which states has Ernie worked?
QUESTION 4: How many sites have Antilocapra records?
QUESTION 5: What state has the easternmost location of Antilocapra in this search? (For comparison, the eastern range limit of Antilocapra today is in western Iowa).
QUESTION 6: The generation of fossil records is labor intensive and hence expensive - e.g. the costs of fieldwork, the money spent on radiocarbon dates, the time required for a trained analysis to identify specimens, etc. A rough time/cost estimate for a single vertebrate fossil record is on the order of two years and $30,000. Given this, give an order-of-magnitude estimate of the number of person-years and dollars it took to generate these fossil vertebrate records now stored in Neotoma. (Order-of-magnitude = 10 person-years? 100 person-years? 1000? etc.)
QUESTION 7: Does mastodon tend to live in places with spruce, or without spruce? Suggest two hypotheses that might explain the observed association (or lack of one).
QUESTION 8: Describe the history of Mammuthus distributions in North America over the last 21,000 years.
QUESTION 9: What other information would you want about this date to determine authenticity?
QUESTION 11: What is the difference between information in the “Chronology” tab of the Vertebrate dataset versus the information within the Geochronology dataset?
QUESTION 12: When was the last time Aplodontia rufa was seen at the site?
GET statements: values appended to a URL after a ? that are processed as parameters to be passed to the code running on the server. For example, we can think of http://google.com/?q=bananas as if it were an R function called search(), passed as search(q="bananas").Google has APIs for most of its products. You could write an R (or Python, etc.) script that would connect to the Google Calendar API to allow you to automatically change events on your calendar or report agenda items back to you in your own custom environment. Hilary Parker has a nice example for adding sunsets to Google Calendar using R.
APIs exist for many important biological and paleobiological databases. We will work through examples from GeoLocate, the georeferencing service, iDigBio, the Paleobiology Database, and Neotoma. Different databases/services produce different kinds of API returns. The simplest kinds are comma delimited text files, but many APIs are now returning JSON documents. We will begin with text and move on to JSON.
The full documentation for the Paelobiology Database API is located at https://paleobiodb.org/data1.2/
The PaleoBioDB API uses calls to different URLs to return different kinds of data. We’ll try some calls to the occurrence APIs. In your web browser, try typing:
https://paleobiodb.org/data1.2/occs/list.txt?base_name=Camelidae&interval=Pleistocene&show=loc,class
Note that this call is telling the API to search the base_name, “Camelidae”; the interval, “Pleistocene”; and to show the attributes loc,class. In this API, base_name searches for a taxonomic name at any level of the Linnaean hierarchy and returns any occurrences with that name, its synonyms, or its subtaxa. You should try changing to a name of interest to you and viewing the results.
Similarly, interval returns occurrences that fall within the specific geologic interval. These intervals include North American Land Mammal Ages as well as the conventional Geologic Timescale. Try “Blancan” and “Burdigalian” to see what happens. Try an interval of interest to you.
You can search by taxonomy, age, environment, even by the specific locality or occurrence ID in the database. The list of potential calls is here: https://paleobiodb.org/data1.2/occs/list_doc.html
You can narrow your results geographically by invoking lngmin and lngmax as well as latmin and latmax (remembering to separate parameters using an &). Try for a few minutes to explore the data with different calls, using the call list.
iDigBio (Integrated Digitized Biocollections) is the central resource for searching digital specimen data liberated by the NSF’s Advancing the Digitization of Biological Collections (ADBC) program. This program funds networks of collections centered on research themes (Thematic Collection Networks, TCNs) to enter specimen data into databases, to georeference collection locations, and to photograph those specimens. Most importantly, these data have to be made available online and connected to the iDigBio portal. iDigBio cleans up the data and provides several ways to access them, including the API we will be investigating.
First, try running this query:
There are several URL encoding tags in this statement. The %7B means {, %22 means ", the %3A means :, the %2C is a comma, and the %7D is }. The + means a space. So this statement translates as ?rq={"scientificname": "camelus bactrianus", "hasImage": true}&limit=5. Try changing some of the values and seeing what you get.
The format used to structure the ?rq= parameter above is called JSON, or JavaScript Object Notation. The output you see when you submit the API call is also in JSON. It is an important way to transmit the sort of complicated, nested data saved in relational databases when you have to send a flat file over the web, and a newer alternative to XML, which you may have seen before.
The documentation for the iDigBio API is at https://www.idigbio.org/wiki/index.php/IDigBio_API
At its most basic level, JSON transmits data objects in attribute-value pairs. It has come to replace XML, which was the previous standard for this sort of data transmission.
JSON is composed of objects, enclosed by curly brackets, which may have any number of attributes named in quotes, with values after a colon, separated by commas. You may also present an array, or an ordered collection of values, enclosed in square brackets.
As an example, you could present an occurrence like this example from the Neotoma API:
{
"SiteLongitudeWest": -103.31666666666666,
"SiteLatitudeSouth": 34.283333333333339,
"TaxonName": "Smilodon fatalis",
"VariableElement": "bone/tooth",
"Value": 1.0,
"VariableContext": null,
"TaxaGroup": "MAM",
"SampleAgeYounger": 15332.0,
"SampleAgeOlder": 30041.0,
"SiteLongitudeEast": -103.31666666666666,
"SiteAltitude": 1280.0,
"VariableUnits": "present/absent",
"DatasetID": 4564,
"SampleAge": null,
"SiteLatitudeNorth": 34.283333333333339
}
Can you tell what kind of occurrence this JSON object is describing? How old is it? Where is it located?
This record comes from searching the Neotoma DB API. Here is the example API call:
http://api.neotomadb.org/v1/data/sampledata?taxonname=Smilodon*
Notice when you make this call that the JSON is computer-friendly but not human-friendly. Try it again with the ‘pretty’ format tag:
http://api.neotomadb.org/v1/data/sampledata?taxonname=Smilodon*&format=pretty
You should be able to see the nested set of JSON objects, including the occurrences returned as comma-separated objects within "data".
Try experimenting with the search, substituting different names. We’ll do more with the Neotoma API in a bit.
JSON is becoming the standard for data transfer in web services. R has several packages for dealing with JSON-formatted data. We will use some examples from the package RJSONIO. We will also use the package RCurl, which has functions to let you query APIs from within the R environment. The neotoma package is built around the newer httr package.
# Uncomment this line if you haven't already installed any of these packages:
# install.packages(c("RCurl", "RJSONIO"))
# Add the packages to your programming environment
library(RCurl)
library(RJSONIO)
You will also want to make sure you have the latest version of R installed, or else the secure connection (https) won’t work in the following queries.
Now create a query for the PaleoBioDB API using the example from above:
q <- "https://paleobiodb.org/data1.2/occs/list.txt?base_name=Camelidae&interval=Pleistocene&show=loc,class"
Create an object to receive the results:
a <- basicTextGatherer()
And execute that query:
curlPerform(url = q, writefunction = a$update)
Finally, view the data:
a$value()
You can see that the data have come in as a character vector; a long list of text strings with no clear structure. Luckily, the PaleoBioDB API also has a JSON interface.
Change your query to refer to list.json:
q <- "https://paleobiodb.org/data1.2/occs/list.json?base_name=Camelidae&interval=Pleistocene&show=loc,class"
Rerun your query and look at a$value( ) again. How does it look now?
a <- basicTextGatherer()
curlPerform(url = q, writefunction = a$update)
a$value()
You can convert it to a data frame:
tmp <- fromJSON(a$value())
records <- tmp$records
results <- data.frame(records[1], stringsAsFactors = FALSE)
for (x in records[-1]) {
x <- data.frame(x, stringsAsFactors = FALSE)
results <- merge(results, x, by = intersect(names(results),names(x)), all = TRUE)
}
dim(results) # this shows the dimensions of the data frame
head(results) # using head() here only prints the top 6 rows, out of 424
When you have gotten this code to work, go through and make comments to describe what each section is doing. Remember, you can make comments in your R code by placing a hashtag (#) at the beginning of a line, or after a line has run (as with the last few lines in the above chunk of code).
Try playing with your queries to see what you can pull from the PaleoBioDB, or even extending to the other two APIs we have explored.
If you think back to the Neotoma API example, you can see that the example API call is reporting data from only part of the distributed database schema of Neotoma. In fact, the Neotoma API is designed around a set of different URLs, each of which allows a user to search a portion of the database. So, if you want the full Site information for locations with Smilodon present, you would have to search on Smilodon in the SampleData URI (as we did in the example), pull the DatasetID values from those returns, then search on the Dataset URI (http://api.neotomadb.org/v1/data/datasets) for those DatasetIDs, which would, in turn, produce the SiteIDs, which you would then search on the Sites URI (http://api.neotomadb.org/v1/data/sites). This sort of searching would be cumbersome if you were to do it by hand, but fortunately you can script a computer to do it for you. In fact, you don’t have to write the scripts to do it, because they have already been constructed and provided to the community as the R neotoma package.
The neotoma package will the the subject of our next module, but bear in mind that the other APIs also have wrapper packages to simplify data calls in R. The PaleoBioDB has a package, paleobioDB, and iDigBio has a package, ridigbio. Currently, a large group of collaborators is working on a single API and wrapping R package to access both PaleoBioDB and Neotoma at the same time, as well as linking to many online museum databases and iDigBio. We wanted to introduce you to the underlying architecture here so you would understand what these packages are doing, and would know that you can crack them open and hack your own solutions if you cannot get them to give you the data or format of data that you need for your work.
neotoma PackageInstall the neotoma package, then add it to your programming environment
# Uncomment this line if you haven't already installed any of these packages:
# install.packages(c("neotoma"))
# Note, because the `neotoma` package is under active development there may be very recent changes
# you may be interested in. To get those, uncomment the following lines:
# install.packages('devtools')
# devtools::install_github('ropensci/neotoma')
library(neotoma)
neotoma has three core commands: get_site, get_dataset, and get_download. The first two return metadata for sites and datasets; the latter returns data. See Goring et al. (Goring et al. 2015) for a full description of the package and example code. This exercise is partially based on those examples.
We’ll start with get_site. get_site returns a data.frame with metadata about sites. You can use this to find the spatial coverage of data in a region (using get_site with a bounding box), or to get explicit site information easily from more complex data objects. Use the command ?get_site to see all the options available.
You can easily search by site name, for example, finding “Samwell Cave”.
samwell_site <- get_site(sitename = 'Samwell%')
Examine the results
print(samwell_site)
## site.name long lat elev
## Samwell Cave -122.2379 40.91691 465
## A site object containing 1 sites and 8 parameters.
While samwell_site is a data.frame it also has class site, that’s why the print output looks a little different than a standard data.frame. That also allows you to use some of the other neotoma functions more easily.
By default the search string is explicit, but because older sites, especially pollen sites entered as part of COHMAP, often had appended textual information (for example (CA:British Columbia)), it’s often good practice to first search using a wildcard character. For example, searching for “Marion” returns three sites:
marion_site <- get_site(sitename = 'Marion%')
print(marion_site)
## site.name long lat elev
## Marion Lake (CA:British Columbia) -122.54722 49.30833 305
## Marion Landfill -83.18611 40.59167 NA
## Marion Lake -121.86241 44.55770 1259
## A site object containing 3 sites and 8 parameters.
You can also search by lat/lon bounding box. This one roughly corresponds to Florida.
FL_sites <- get_site(loc = c(-88, -79, 25, 30))
You can also search by geopolitical name or geopolitical IDs (gpid) stored in Neotoma. For a list of names and gpids, go to http://api.neotomadb.org/apdx/geopol.htm, or use the get_table(table.name = "GeoPoliticalUnits") command. This command works either with an explicit numeric ID, or with a text string:
# get all sites in New Mexico (gpid=7956)
NM_sites <- get_site(gpid = 7956)
# get all sites in Wisconsin
WI_sites <- get_site(gpid = "Wisconsin")
data.frame stores vectors of equal length. The nice thing about a data.frame is that each vector can be of a different type (character, numeric values, etc.). In RStudio, you can use the Environment panel in upper right to explore variables.
We pointed out before that the object returned from get_site is both a data.frame and a site object. Because it has a special print method some of the information from the full object is obscured when printed. You can see all the data in the data.frame using str (short for structure):
str(samwell_site)
Let’s look at the description field:
samwell_site$description
marion_site$description
The structure of the Neotoma data model, as expressed through the API is roughly: “counts within download, download within dataset, dataset within site”. So a dataset contains more information than a site, about a particular dataset from that site. A site may have a single associated dataset, or multiple. For example:
samwell_datasets <- get_dataset(samwell_site)
print(samwell_datasets)
get_dataset returns a list of datasets containing the metadata for each dataset
We can pass output from get_site to get_dataset, even if get_site returns multiple sites
marion.meta.dataset <- get_dataset(marion_site)
Let’s look at the metadata returned for the Marion% search:
marion.meta.dataset
Both Marion Lake (CA: British Columbia) and Marion Landfill have a geochronology dataset, while Marion Lake (CA: British Columbia) has a pollen dataset and Marion Landfill a vertebrate fauna dataset. The third site, Marion Lake, has a diatom dataset and a water chemistry dataset.
Question 13: Are Marion Lake (CA: British Columbia) and Marion Lake the same site? Try using the browse function.
Further searches (i.e., to examine the lat/longs) or consulting the literature would be required. This illustrates the caution needed when using this, or any other, database. Human judgement still needs to be exercised when examining results from databases. The get_publication() function can return citation information if neccessary, but often using the browse() function is very helpful in examining a database further.
get_download() returns a list that stores a list of download objects - one for each retrieved dataset. Note that get_download() returns the actual data associated with each dataset, rather than a list of the available datasets, as in get_dataset() above. Each download object contains a suite of data for the samples in that dataset.
Get data for all datasets at Samwell Cave. get_download() will accept an object of class dataset (ie, samwell_dataset), but also of class site, since it will automatically query for the datasets associated in each site. Compare the returned messages for the following two commands:
samwell_all <- get_download(samwell_site)
## Warning in get_download.default(datasetid, verbose = verbose): Some datasetids returned empty downloads, be aware that length(datasetid) is longer than the download_list.
samwell_all <- get_download(samwell_datasets)
## Warning in get_download.default(datasetid, verbose = verbose): Some datasetids returned empty downloads, be aware that length(datasetid) is longer than the download_list.
print(samwell_all)
There are a number of messages that appear. These should be suppressed with the flag verbose = FALSE in the function call. One thing you’ll note is that not all of the datasets can be downloaded directly to a download objct. This is because geochronologic datasets have a different data structure than other data, requiring different fields, and as such, they can be obtained using the get_geochron() function:
samwell_geochron <- get_geochron(samwell_site)
print(samwell_geochron)
The result is effectively the inverse of the first.
Get the vertebrate datasets for just Samwell Cave Popcorn Dome (dataset 14262):
samwell_pd <- get_download(14262)
## API call was successful. Returned record for Samwell Cave
Let’s examine the available data in this download
str(samwell_pd[[1]])
There are 6 associated fields:
Within the download object, sample.meta stores the core depth and age information for that dataset. We just want to look at the first few lines, so are using the head() function. Let’s explore different facets of the dataset
head(samwell_pd[[1]]$sample.meta)
#taxon.list stores a list of taxa found in the dataset
head(samwell_pd[[1]]$taxon.list)
#counts stores the the counts, presence/absence data, or percentage data for each taxon for each sample
head(samwell_pd[[1]]$counts)
If we have time, we can work through the example given the Neotoma package paper in Open Quaternary.
Jump to the section “Examples”, then scroll down to “Mammal Distributions in the Pleistocene”. The R code is reproduced below, with some modifications to accomodate changes since publication.
# install.packages('ggplot2','reshape2')
library("ggplot2")
library("reshape2")
# Bounding box is effectively the continental USA, excluding Alaska
mam.set <- get_dataset(datasettype= 'vertebrate fauna', loc = c(-125, 30, -115, 49.5))
# Retrieving this many sites can be very time consuming.
# If you are running code that takes a long time, but has an obvious output, you can stick it
# in an output file and call it later. This is a nice trick :)
# if ("mam_dl.rds" %in% list.files('objects')) {
# mam.dl <- readRDS('objects/mam_dl.rds')
# } else {
# mam.dl <- get_download(mam.set)
# saveRDS(mam.dl, file = 'objects/mam_dl.rds')
# }
githubURL <- "https://github.com/NeotomaDB/Workshops/blob/master/SVP2016/R/objects/mam.dl.RData?raw=true"
load(url(githubURL))
compiled.mam <- compile_downloads(mam.dl)
time.bins <- c(500, 4000, 10000, 15000, 20000)
mean.age <- rowMeans(compiled.mam[,c('age.old','age.young', 'age')], na.rm = TRUE)
interval <- findInterval(mean.age, time.bins)
periods <- c('Modern',
'Late Holocene',
'Early-Mid Holocene',
'Late Glacial',
'Full Glacial',
'Late Pleistocene')
compiled.mam$ageInterval <- periods[interval + 1]
mam.melt <- melt(compiled.mam, measure.vars = 11:(ncol(compiled.mam)-1), na.rm = TRUE, factorsAsStrings = TRUE)
mam.melt <- transform(mam.melt, ageInterval = factor(ageInterval, levels = periods))
mam.lat <- dcast(data = mam.melt, variable ~ageInterval, value.var = 'lat',
fun.aggregate = mean, drop = TRUE)[,c(1, 3, 5, 6)] # variable Late Holocene Late Glacial Full Glacial
# We only want taxa that appear at all time periods:
mam.lat <- mam.lat[rowSums(is.na(mam.lat)) == 0, ]
# Group the samples based on the range &direction (N vs S) of migration.
# A shift of only 1 degree is considered stationary.
mam.lat$grouping <- factor(findInterval (mam.lat[,2] - mam.lat[, 4], c(-11, -1, 1, 20)),
labels = c('Southward','Stationary','Northward'))
mam.lat.melt <- melt(mam.lat)
colnames(mam.lat.melt)[2:3] <- c('cluster','Era')
ggplot(mam.lat.melt, aes(x = Era, y = value)) +
geom_path(aes(group = variable, color =cluster)) +
facet_wrap(~ cluster) +
scale_x_discrete(expand = c(.1,0)) +
ylab('Mean Latitude of Occurrence') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
neotoma R package: https://github.com/ropensci/neotoma (also accessible via the NeotomaDB github)Goring, S., A. Dawson, G. L. Simpson, K. Ram, R. W. Graham, E. C. Grimm, and J. W. Williams. 2015. neotoma: A programmatic interface to the neotoma paleoecological database. Open Quaternary 1:2.
| Funded by:
National Science Foundation Directorate for Geosciences Division of Earth Sciences |
EarthCube Integrative Activities National Sciences Foundation |